Note:
Dataset - 1 = 22 features
['price', 'room_bed', 'room_bath', 'living_measure', 'lot_measure',
'ceil', 'coast', 'sight', 'condition', 'quality', 'ceil_measure',
'basement', 'yr_built', 'living_measure15', 'lot_measure15',
'furnished', 'total_area', 'month_year', 'City', 'has_basement',
'HouseLandRatio', 'has_renovated']
Dataset - 2 = 31 features (important features after imputing dummy and analyzing different models)
['price', 'room_bed', 'room_bath', 'living_measure', 'lot_measure',
'ceil', 'sight', 'condition', 'ceil_measure', 'basement', 'yr_built',
'yr_renovated', 'zipcode', 'lat', 'long', 'living_measure15',
'lot_measure15', 'total_area', 'coast_1', 'quality_3', 'quality_4',
'quality_5', 'quality_6', 'quality_7', 'quality_8', 'quality_9',
'quality_10', 'quality_11', 'quality_12', 'quality_13', 'furnished_1'
Below are 2 files needed to be added to you current working directory.
1. Need to add file USA ZipCodes_1.xlsx to current working directory to access this data
2. Add the folder WA to your current working directory
3. Install below 2 libraries
conda install -c conda-forge/label/cf201901 geopandas
conda install -c conda-forge/label/cf201901 shapely
</b>
This Jupyter Notebook is done as part of PGPML Great Learning Programme for Capstone Project. Let's first, define the problem, objective of this excercise.
We have the problem statment well defined in the given document which is as follows
As a house value is simply more than location and square footage. Like the features that make up a person, an educated party would want to know all aspects that give a house its value. For example, if we want to sell a house and we don't know the price which we can take, as it can't be too low or too high. To find house price we usually try to find similar properties in our neighbourhood and based on collected data we trying to assess our house price.
When any person/business wants to sell or buy a house, they always face this kind of issue as they don't know the price which they should offer. Due to this they might be offering too low or high for the property. Therefore, we can analyze the available data of the properties in the area and can predict the price. We need to find how these attributes influence the house prices Right pricing is very imporatnt aspect to sell house. It is very important to understand what are the factors and how they influence the house price. Objective is to predict the right price of the house based on the attributes
Build model which will predict the house price when required features passed to the model. So we will
As people don't know the features/aspects which commulate property price, we can provide them HouseBuyingSelling guiding services in the area so they can buy or sell their property with most suitable price tag and they didn't lose their hard earned money by offering low price or keep waiting for buyers by putting high prices.
First, we will load the data from the given csv(comma seperated values) file provided as part of the Capstone Project.
# loading the library required for data loading and processing
import pandas as pd
import numpy as np
#Supress warnings
import warnings
warnings.filterwarnings('ignore')
# read the data using pandas function from 'innercity.csv' file
house_df = pd.read_csv('innercity.csv')
# let's check whether data loaded successfully or not, by checking first few records
house_df.head()
Data is loaded successfully as we can see first 5 records from the dataset.
After loading data into our pandas library dataframe, we can now try to understand the kind of data we have with us.
# print the number of records and features/aspects we have in the provided file
house_df.shape
We have more than 21k records having 23 features
# let's check out the columns/features we have in the dataset
house_df.columns
From the above we can see the different columns we have in dataset.
These columns provide below information
# let's see the data types of the features
house_df.info()
In the dataset, we have more than 21k records and 23 columns, out of which
# let's check whether our dataset have any null/missing values
house_df.isnull().sum()
We don't have any null or missing values for any of the columns
# let's check whether there's any duplicate record in our dataset or not. If present, we have to remove them
house_df.duplicated().sum()
We don't have any duplicate record in out dataset. So we can say we have more than 21k Unique records
# let's do the 5 - factor analysis of the features
house_df.describe().transpose()
From above analysis we got to know,
Most columns distribution is Right-Skewed and only few features are Left-Skewed (like room_bath, yr_built, lat).
We have columns which are Categorical in nature are -> coast, yr_renovated, furnished
Let's do some visual data analysis of the features
#let's first import the required libraries for the plots
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# size of plots to make it uniform throughout our analysis in the notebook
plotSizeX = 12
plotSizeY = 6
# let's boxplot all the numerical columns and see if there any outliers
for i in house_df.iloc[:, 2:].columns:
house_df.iloc[:, 1:].boxplot(column=i)
plt.show()
We can see, there are lot of features which have outliers. So we might need to treat those before building model
#cid - CID is appearing muliple times, it seems data contains house which is sold multiple times
cid_count=house_df.cid.value_counts()
cid_count[cid_count>1].shape
We have 176 properties that were sold more than once in the given data
#we will create new data frame that can be used for modeling
#We will convert the dayhours to 'month_year' as sale month-year is relevant for analysis
house_dfr=house_df.copy()
house_df.dayhours=house_df.dayhours.str.replace('T000000', "")
house_df.dayhours=pd.to_datetime(house_df.dayhours,format='%Y%m%d')
house_df['month_year']=house_df['dayhours'].apply(lambda x: x.strftime('%B-%Y'))
house_df['month_year'].head()
We successfully converted dayhours feature to month_year for better analysis.
house_df['month_year'].value_counts()
We can see, most houses sold in April, July month
house_df.groupby(['month_year'])['price'].agg('mean')
So the time line of the sale data of the properties is from May-2014 to May-2015 and April month have the highest mean price.
house_df.price.describe()
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.distplot(house_df['price'])
The Price is ranging from 75,000 to 77,00,000 and distribution is right-skewed.
house_df['room_bed'].value_counts()
The value of 33 seems to be outlier we need to check the data point before imputing the same
house_df[house_df['room_bed']==33]
Will delete this data point after bivariate analysis as it looks to be an outlier as it has low price for 33 bed room property
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.countplot(house_df.room_bed,color='green')
Most of the houses/properties have 3 or 4 bedrooms
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.countplot(house_df.room_bath,color='green')
house_df['room_bath'].value_counts().sort_index()
Majority of the properties have bathroom in the range of 1.0 to 2.5
plt.figure(figsize=(plotSizeX, plotSizeY))
print("Skewness is :",house_df.room_bath.skew())
sns.distplot(house_df.room_bath)
#Data is skewed as visible from plot, as its distribution is normal
plt.figure(figsize=(plotSizeX, plotSizeY))
print("Skewness is :",house_df.living_measure.skew())
sns.distplot(house_df.living_measure)
house_df.living_measure.describe()
Data distribution tells us, living_measure is right-skewed.
#Let's plot the boxplot for living_measure
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.boxplot(house_df.living_measure)
There are many outliers in living measure. Need to review further to treat the same.
#checking the no. of data points with Living measure greater than 8000
house_df[house_df['living_measure']>8000]
We have only 9 properties/house which have more than 8k living_measure. So will treat these outliers.
#Data is skewed as visible from plot
plt.figure(figsize=(plotSizeX, plotSizeY))
print("Skewness is :",house_df.lot_measure.skew())
sns.boxplot(house_df.lot_measure)
house_df.lot_measure.describe()
#checking the no. of data points with Lot measure greater than 1250000
house_df[house_df['lot_measure']>1250000]
We have only 1 property with more than 12,50,000 lot_measure. So we need to treat this.
#let's see the ceil count for all the records
house_df.ceil.value_counts()
We can see, most houses have 1 floor
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.countplot('ceil',data=house_df)
Above grapth confirming the same, that most properties have 1 and 2 floors
#coast - most houses donot have waterfront view, very few are waterfront
house_df.coast.value_counts()
#sight - most sights have not been viewed
house_df.sight.value_counts()
#condition - Overall most houses are rated as 3 and above for its condition overall
house_df.condition.value_counts()
#Quality - most properties have quality rating between 6 to 10
house_df.quality.value_counts()
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.countplot('quality',data=house_df)
#checking the no. of data points with quality rating as 13
house_df[house_df['quality']==13]
There are only 13 propeties which have the highest quality rating
#ceil_measure - its highly skewed
print("Skewness is :", house_df.ceil_measure.skew())
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.distplot(house_df.ceil_measure)
house_df.ceil_measure.describe()
sns.factorplot(x='ceil',y='ceil_measure',data=house_df, size = 4, aspect = 2)
There is no pattern in Ceil Vs Ceil_measure
The vertival lines at each point represent the inter quartile range of values at that point
#basement_measure
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.distplot(house_df.basement)
We can see 2 gaussians, which tells us there are propeties which don't have basements and some have the basements
house_df[house_df.basement==0].shape
We have almost 60% of the properties without basement
#houses have zero measure of basement i.e. they donot have basements
#let's plot boxplot for properties which have basements only
house_df_base=house_df[house_df['basement']>0]
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.boxplot(house_df_base['basement'])
We can clearly see, there are outliers. We need to treat this before our model.
#checking the no. of data points with 'basement' greater than 4000
house_df[house_df['basement']>4000]
We have only 2 properties with more than 4,000 measure basement
#Distribution of houses having basement
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.distplot(house_df_base.basement)
Distribution having basement is right-skewed
#house range from new to very old
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.distplot(house_df.yr_built)
The built year of the properties range from 1900 to 2014 and we can see upward trend with time
house_df[house_df['yr_renovated']>0].shape
Only 914 houses were renovated out of 21613 houses
#yr_renovated - plot of houses which are renovated
house_df_reno=house_df[house_df['yr_renovated']>0]
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.distplot(house_df_reno.yr_renovated)
Now will create age column from columns : yr_built & yr_renovated
#For geographic visual
import geopandas as gpd
from shapely.geometry import Point, Polygon
#For current working directory
import os
cwd = os.getcwd()
## Need to add file USA ZipCodes_1.xlsx to current working directory to access this data
USAZip=pd.read_excel("USA ZipCodes_1.xlsx",sheet_name="Sheet8")
USAZip.head()
house_df=house_df.merge(USAZip,how='left',on='zipcode')
#house_df.drop_duplicates()
#let's see the shape of our dataframe
house_df.shape
Now we have 27 features
#Add the folder WA to your current working directory
usa = gpd.read_file(cwd+'\\WA\\WSDOT__City_Limits.shp')
usa.head()
gdf = gpd.GeoDataFrame(
house_df,geometry = [Point(xy) for xy in zip(house_df['long'], house_df['lat'])])
#We can now plot our ``GeoDataFrame``
ax=usa[usa.CityName.isin(house_df.City.unique())].plot(
color='white', edgecolor='black',figsize=(20,8))
plt.figure(figsize=(15,15))
gdf.plot(ax=ax, color='green', marker='o',markersize=0.1)
#let's see the columns of dataframe once again
house_df.columns
So we have 'City', 'Country', 'Type' as new feature in our dataframe
house_df.Type.value_counts()
As the type is same for all the columns, we will remove this column in further analysis
house_df.City.value_counts()
So we have most properties in 'Seattle' city and least in 'Medina' city
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.countplot('furnished',data=house_df)
house_df.furnished.value_counts()
Most properties are not furnished. Furnish column need to be converted into categorical column
# let's plot all the variables and confirm our above deduction with more confidence
sns.pairplot(house_df, diag_kind = 'kde')
From above pair plot, we observed/deduced below
In brief, below featues should be converted to Categorical Variable
ceil, coast, sight, condition, quality, yr_renovated, furnished
And below columns can be dropped after checking pearson factor
zipcode, lat, long, living_measure15, lot_measure15, total_area
# let's see corelatoin between the different features
house_corr = house_df.corr(method ='pearson')
house_corr
We have linear relationships in below featues as we got to know from above matrix
We can plot heatmap and can easily confirm our above findings
# Plotting heatmap
plt.subplots(figsize =(15, 8))
sns.heatmap(house_corr,cmap="YlGnBu",annot=True)
#month,year in which house is sold. Price is not influenced by it, though there are outliers and can be easily seen.
house_df['month_year'] = pd.to_datetime(house_df['month_year'], format='%B-%Y')
house_df.sort_values(["month_year"], axis=0,
ascending=True, inplace=True)
house_df["month_year"] = house_df["month_year"].dt.strftime('%B-%Y')
sns.factorplot(x='month_year',y='price',data=house_df, size=4, aspect=2)
plt.xticks(rotation=90)
#groupby
house_df.groupby('month_year')['price'].agg(['mean','median','size'])
The mean price of the houses tend to be high during March,April, May as compared to that of September, October, November,December period.
#Room_bed - outliers can be seen easily. Mean and median of price increases with number bedrooms/house uptill a point
#and then drops
sns.factorplot(x='room_bed',y='price',data=house_df, size=4, aspect=2)
#groupby
house_df.groupby('room_bed')['price'].agg(['mean','median','size'])
There is clear increasing trend in price with room_bed
#room_bath - outliers can be seen easily. Overall mean and median price increares with increasing room_bath
sns.factorplot(x='room_bath',y='price',data=house_df,size=4, aspect=2)
plt.xticks(rotation=90)
#groupby
house_df.groupby('room_bath')['price'].agg(['mean','median','size'])
There is upward trend in price with increase in room_bath
#living_measure - price increases with increase in living measure
plt.figure(figsize=(plotSizeX, plotSizeY))
print(sns.scatterplot(house_df['living_measure'],house_df['price']))
house_df['living_measure'].describe()
There is clear increment in price of the property with increment in the living measure But there seems to be one outlier to this trend. Need to evaluate the same
#lot_measure - there seems to be no relation between lot_measure and price
#lot_measure - data value range is very large so breaking it get better view.
plt.figure(figsize=(plotSizeX, plotSizeY))
print(sns.scatterplot(house_df['lot_measure'],house_df['price']))
house_df['lot_measure'].describe()
There doesnt seem to be no relation between lot_measure and price trend
#lot_measure <25000
plt.figure(figsize=(plotSizeX, plotSizeY))
x=house_df[house_df['lot_measure']<25000]
print(sns.scatterplot(x['lot_measure'],x['price']))
x['lot_measure'].describe()
Almost 95% of the houses have <25000 lot_measure. But there is no clear trend between lot_measure and price
#lot_measure >100000 - price increases with increase in living measure
plt.figure(figsize=(plotSizeX, plotSizeY))
y=house_df[house_df['lot_measure']<=75000]
print(sns.scatterplot(y['lot_measure'],y['price']))
#y['lot_measure'].describe()
#ceil - median price increases initially and then falls
print(sns.factorplot(x='ceil',y='price',data=house_df, size = 4, aspect = 2))
#groupby
house_df.groupby('ceil')['price'].agg(['mean','median','size'])
There is some slight upward trend in price with the ceil
#coast - mean and median of waterfront view is high however such houses are very small in compare to non-waterfront
#Also, living_measure mean and median is greater for waterfront house.
print(sns.factorplot(x='coast',y='price',data=house_df, size = 4, aspect = 2))
#groupby
house_df.groupby('coast')['living_measure','price'].agg(['median','mean'])
The house properties with water_front tend to have higher price compared to that of non-water_front properties
#sight - have outliers. The house sighted more have high price (mean and median) and have large living area as well.
print(sns.factorplot(x='sight',y='price',data=house_df, size = 4, aspect = 2))
#groupby
house_df.groupby('sight')['price','living_measure'].agg(['mean','median','size'])
Properties with higher price have more no.of sights compared to that of houses with lower price
#Sight - Viewed in relation with price and living_measure
#Costlier houses with large living area are sighted more.
plt.figure(figsize=(plotSizeX, plotSizeY))
print(sns.scatterplot(house_df['living_measure'],house_df['price'],hue=house_df['sight'],palette='Paired',legend='full'))
The above graph also justify that: Properties with higher price have more no.of sights compared to that of houses with lower price
#condition - as the condition rating increases its price and living measure mean and median also increases.
print(sns.factorplot(x='condition',y='price',data=house_df, size = 4, aspect = 2))
#groupby
house_df.groupby('condition')['price','living_measure'].agg(['mean','median','size'])
The price of the house increases with condition rating of the house
#Condition - Viewed in relation with price and living_measure. Most houses are rated as 3 or more.
#We can see some outliers as well
plt.figure(figsize=(plotSizeX, plotSizeY))
print(sns.scatterplot(house_df['living_measure'],house_df['price'],hue=house_df['condition'],palette='Paired',legend='full'))
So we found out that smaller houses are in better condition and better condition houses are having higher prices
#quality - with grade increase price and living_measure increase (mean and median)
print(sns.factorplot(x='quality',y='price',data=house_df, size = 4, aspect = 2))
#groupby
house_df.groupby('quality')['price','living_measure'].agg(['mean','median','size'])
There is clear increase in price of the house with higher rating on quality
#quality - Viewed in relation with price and living_measure. Most houses are graded as 6 or more.
#We can see some outliers as well
plt.figure(figsize=(plotSizeX, plotSizeY))
print(sns.scatterplot(house_df['living_measure'],house_df['price'],hue=house_df['quality'],palette='coolwarm_r',
legend='full'))
#ceil_measure - price increases with increase in ceil measure
plt.figure(figsize=(plotSizeX, plotSizeY))
print(sns.scatterplot(house_df['ceil_measure'],house_df['price']))
house_df['ceil_measure'].describe()
There is upward trend in price with ceil_measure
#basement - price increases with increase in ceil measure
plt.figure(figsize=(plotSizeX, plotSizeY))
print(sns.scatterplot(house_df['basement'],house_df['price']))
house_df['basement'].describe()
We will create the categorical variable for basement 'has_basement' for houses with basement and no basement.This categorical variable will be used for further analysis.
#Binning Basement to analyse data
def create_basement_group(series):
if series == 0:
return "No"
elif series > 0:
return "Yes"
house_df['has_basement'] = house_df['basement'].apply(create_basement_group)
#basement - after binning we data shows with basement houses are costlier and have higher
#living measure (mean & median)
print(sns.factorplot(x='has_basement',y='price',data=house_df, size = 4, aspect = 2))
house_df.groupby('has_basement')['price','living_measure'].agg(['mean','median','size'])
The houses with basement has better price compared to that of houses without basement
#basement - have higher price & living measure
plt.figure(figsize=(plotSizeX, plotSizeY))
print(sns.scatterplot(house_df['living_measure'],house_df['price'],hue=house_df['has_basement']))
#yr_built - outliers can be seen easily.
plt.figure(figsize=(plotSizeX, plotSizeY))
print(sns.scatterplot(house_df['yr_built'],house_df['living_measure']))
#groupby
house_df.groupby('yr_built')['price'].agg(['mean','median','size'])
We will create new variable: Houselandratio - This is proportion of living area in the total area of the house. We will explore the trend of price against this houselandratio.
#HouseLandRatio - Computing new variable as ratio of living_measure/total_area
#Significes - Land used for construction of house
house_df["HouseLandRatio"]=np.round((house_df['living_measure']/house_df['total_area']),2)*100
house_df["HouseLandRatio"].head()
#yr_renovated -
plt.figure(figsize=(plotSizeX, plotSizeY))
x=house_df[house_df['yr_renovated']>0]
print(sns.scatterplot(x['yr_renovated'],x['price']))
#groupby
x.groupby('yr_renovated')['price'].agg(['mean','median','size'])
So most houses are renovated after 1980's. We will create new categorical variable 'has_renovated' to categorize the property as renovated and non-renovated. For further ananlysis we will use this categorical variable.
#Lets try to group yr_renovated
#Binning Basement to analyse data
def create_renovated_group(series):
if series == 0:
return "No"
elif series > 0:
return "Yes"
house_df['has_renovated'] = house_df['yr_renovated'].apply(create_renovated_group)
#has_renovated - renovated have higher mean and median, however it does not confirm if the prices of house renovated
#actually increased or not.
#HouseLandRatio - Renovated house utilized more land area for construction of house
plt.figure(figsize=(plotSizeX, plotSizeY))
print(sns.scatterplot(house_df['living_measure'],house_df['price'],hue=house_df['has_renovated']))
#groupby
house_df.groupby(['has_renovated'])['price','HouseLandRatio'].agg(['mean','median','size'])
Renovated properties have higher price than others with same living measure space.
#pd.crosstab(house_df['yearbuilt_group'],house_df['has_renovated'])
#has_renovated - have higher price & living measure
plt.figure(figsize=(plotSizeX, plotSizeY))
x=house_df[house_df['yr_built']<2000]
print(sns.scatterplot(x['living_measure'],x['price'],hue=x['has_renovated']))
#furnished - Furnished has higher price value and has greater living_measure
plt.figure(figsize=(plotSizeX, plotSizeY))
print(sns.scatterplot(house_df['living_measure'],house_df['price'],hue=house_df['furnished']))
#groupby
house_df.groupby('furnished')['price','living_measure','HouseLandRatio'].agg(['mean','median','size'])
Furnished houses have higher price than that of the Non-furnished houses
#City - outliers can be seen easily.
print(sns.factorplot(x='City',y='price',data=house_df, size = 4, aspect = 2))
plt.xticks(rotation=90)
#groupby
house_df.groupby('City')['price'].agg(['mean','median','size']).sort_values(by='median',ascending=False)
From the above graph, few cities have higher average price of the houses compared to others. We need to further analyse why the price varies among cities.
#City mean price distribution with average
city_price=pd.DataFrame(house_df.groupby('City')['price'].agg(['mean','median','size']))
indx=city_price.index
overall_price_mean=np.mean(house_df['price'])
overall_price_median=np.median(house_df['price'])
fig, ax1 = plt.subplots(figsize=(plotSizeX, plotSizeY))
barlist=ax1.bar(city_price.index,city_price['mean'],color='gray')
plt.xticks(rotation=90)
ax1.axhline(overall_price_mean, color="red")
ax1.text(1.02, overall_price_mean, "{0:.2f}".format(round(overall_price_mean,2)), va='center', ha="left", bbox=dict(facecolor="w",alpha=0.5),
transform=ax1.get_yaxis_transform())
plt.title("Cities and Mean Price")
plt.show()
As we can see from above grapgh, majorly below cities have higher mean house prices
#City median price distribution with average
fig, ax1 = plt.subplots(figsize=(plotSizeX, plotSizeY))
barlist=ax1.bar(city_price.index,city_price['median'],color='green')
plt.xticks(rotation=90)
ax1.axhline(overall_price_median, color="red")
ax1.text(1.02, overall_price_median, "{0:.2f}".format(round(overall_price_median,2)), va='center', ha="left", bbox=dict(facecolor="w",alpha=0.5),
transform=ax1.get_yaxis_transform())
plt.title("Cities and Median Price")
plt.show()
As we can see from above grapgh, majorly below cities have higher median house prices
#let's make the copy of the dataframe, before making any furhter changes
house_df_bdp=house_df.copy()
We have seen outliers for columns room_bath(33 bed), living_measure, lot_measure, ceil_measure and Basement
def outlier_treatment(datacolumn):
sorted(datacolumn)
Q1,Q3 = np.percentile(datacolumn , [25,75])
IQR = Q3-Q1
lower_range = Q1-(1.5 * IQR)
upper_range = Q3+(1.5 * IQR)
return lower_range,upper_range
Using the above function, lets get the lowerbound and upperbound values
lowerbound,upperbound = outlier_treatment(house_df.ceil_measure)
print(lowerbound,upperbound)
Lets check which column is considered as an outlier
house_df[(house_df.ceil_measure < lowerbound) | (house_df.ceil_measure > upperbound)]
We got 611 records which are outliers
#dropping the record from the dataset
house_df.drop(house_df[ (house_df.ceil_measure > upperbound) | (house_df.ceil_measure < lowerbound) ].index, inplace=True)
house_df.shape
#ceil_measure
print("Skewness is :", house_df.ceil_measure.skew())
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.distplot(house_df.ceil_measure)
house_df.ceil_measure.describe()
After treating outliers of ceil_measure, the data has reduced by about 600(~3%) data points but data is nicely distributed
lowerbound_base,upperbound_base = outlier_treatment(house_df.basement)
print(lowerbound_base,upperbound_base)
house_df[(house_df.basement < lowerbound_base) | (house_df.basement > upperbound_base)]
We got 408 records as outliers, let's drop these outliers
#dropping the record from the dataset
house_df.drop(house_df[ (house_df.basement > upperbound_base) | (house_df.basement < lowerbound_base) ].index, inplace=True)
house_df.shape
#basement_measure
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.distplot(house_df.basement)
After treating outliers of basement, we can see that 400(~2%) data points got imputed. Total about 5% data has been imputed after treating ceil_measure and basement.
#Let's see the boxplot now for basement
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.boxplot(house_df['basement'])
lowerbound_lim,upperbound_lim = outlier_treatment(house_df.living_measure)
print(lowerbound_lim,upperbound_lim)
house_df[(house_df.living_measure < lowerbound_lim) | (house_df.living_measure > upperbound_lim)]
We got 178 records as outliers. Let's treat this by dropping
#dropping the record from the dataset
house_df.drop(house_df[ (house_df.living_measure > upperbound_lim) | (house_df.living_measure < lowerbound_lim) ].index, inplace=True)
#let's see the boxplot after dropping the outliers
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.boxplot(house_df['living_measure'])
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.distplot(house_df.living_measure)
By treating outliers of living_measure, we lost 178 data points more and data distribution looks normal
# shape of the data after imputing outliers in living_column
house_df.shape
lowerbound_lom,upperbound_lom = outlier_treatment(house_df.lot_measure)
print(lowerbound_lom,upperbound_lom)
house_df[(house_df.lot_measure < lowerbound_lom) | (house_df.lot_measure > upperbound_lom)]
We got 2155 records which are outliers. Let's drop these outlier records.
#dropping the record from the dataset
house_df.drop(house_df[ (house_df.lot_measure > upperbound_lom) | (house_df.lot_measure < lowerbound_lom) ].index, inplace=True)
#let's plot after treating outliers
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.boxplot(house_df['lot_measure'])
house_df.shape
Total outliers in the lot_measure are 2128 data points. But still we are going ahead with imputing the data. We will analyze later whether there is any impact on the data set or not.
#As we know for room_bed = 33 was outlier from our earlier findings, let's see the record and drop it
house_df[house_df['room_bed']==33]
#dropping the record from the dataset
house_df.drop(house_df[ (house_df.room_bed == 33) ].index, inplace=True)
house_df.shape
#let's see the feature/columns and drop the unneccessary features
house_df.columns
As we already have this information in other features. We will drop the unwanted columns from new copied dataframe instance : cid,dayhours,yr_renovated,zipcode,lat,long,county,type
#Let's create another dataframe for modeling
df_model=house_df.copy()
#let's check the new copy of dataframe by printing first few records
df_model.head()
New instance of dataframe for model created successfully
#let's verify the columns
df_model.columns
#Dropping the feature not required in 1st Iteration
df_final=df_model.drop(['cid','dayhours','yr_renovated','zipcode','lat','long','County','Type'],axis=1)
df_final.shape
df_final.head()
df_final.columns
# Getting dummies for columns ceil, coast, sight, condition, quality, yr_renovated, furnished
dff = pd.get_dummies(df_final, columns=['room_bed', 'room_bath', 'ceil', 'coast', 'sight', 'condition', 'quality', 'furnished','City',
'has_basement', 'has_renovated'],drop_first=True)
# let's see the data types of the features
dff.shape
dff.columns
dff.head()
#let's drop the month_year column as we already analyzed it
dff=dff.drop(['month_year'],axis=1)
#Creating X, y for training and testing set
X = dff.drop("price" , axis=1)
y = dff["price"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=10)
print(X_train.shape)
print(X_test.shape)
print(X_val.shape)
dff.head()
Let's build the model and see their performances
#importing the necessary libraries
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn import metrics
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
LR1 = LinearRegression()
LR1.fit(X_train, y_train)
#predicting result over test data
y_LR1_predtr= LR1.predict(X_train)
y_LR1_predvl= LR1.predict(X_val)
LR1.coef_
#Model score and Deduction for each Model in a DataFrame
LR1_trscore=r2_score(y_train,y_LR1_predtr)
LR1_trRMSE=np.sqrt(mean_squared_error(y_train, y_LR1_predtr))
LR1_trMSE=mean_squared_error(y_train, y_LR1_predtr)
LR1_trMAE=mean_absolute_error(y_train, y_LR1_predtr)
LR1_vlscore=r2_score(y_val,y_LR1_predvl)
LR1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_LR1_predvl))
LR1_vlMSE=mean_squared_error(y_val, y_LR1_predvl)
LR1_vlMAE=mean_absolute_error(y_val, y_LR1_predvl)
Compa_df=pd.DataFrame({'Method':['Linear Reg Model1'],'Val Score':LR1_vlscore,'RMSE_vl': LR1_vlRMSE, 'MSE_vl': LR1_vlMSE, 'MAE_vl': LR1_vlMAE,'train Score':LR1_trscore,'RMSE_tr': LR1_trRMSE, 'MSE_tr': LR1_trMSE, 'MAE_tr': LR1_trMAE})
#Compa_df = Compa_df[['Method', 'Test Score', 'RMSE', 'MSE', 'MAE']]
Compa_df
The linear regression model performed with scores 0.73 & .72 in training data set and validation data set respectively
sns.set(style="darkgrid", color_codes=True)
with sns.axes_style("white"):
sns.jointplot(x=y_val, y=y_LR1_predvl, kind="reg", color="k")
Lasso1 = Lasso(alpha=1)
Lasso1.fit(X_train, y_train)
#predicting result over test data
y_Lasso1_predtr= Lasso1.predict(X_train)
y_Lasso1_predvl= Lasso1.predict(X_val)
Lasso1.coef_
#Model score and Deduction for each Model in a DataFrame
Lasso1_trscore=r2_score(y_train,y_Lasso1_predtr)
Lasso1_trRMSE=np.sqrt(mean_squared_error(y_train, y_Lasso1_predtr))
Lasso1_trMSE=mean_squared_error(y_train, y_Lasso1_predtr)
Lasso1_trMAE=mean_absolute_error(y_train, y_Lasso1_predtr)
Lasso1_vlscore=r2_score(y_val,y_Lasso1_predvl)
Lasso1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_Lasso1_predvl))
Lasso1_vlMSE=mean_squared_error(y_val, y_Lasso1_predvl)
Lasso1_vlMAE=mean_absolute_error(y_val, y_Lasso1_predvl)
Lasso1_df=pd.DataFrame({'Method':['Linear-Reg Lasso1'],'Val Score':Lasso1_vlscore,'RMSE_vl': Lasso1_vlRMSE, 'MSE_vl': Lasso1_vlMSE, 'MAE_vl': Lasso1_vlMAE,'train Score':Lasso1_trscore,'RMSE_tr': Lasso1_trRMSE, 'MSE_tr': Lasso1_trMSE, 'MAE_tr': Lasso1_trMAE})
Compa_df = pd.concat([Compa_df, Lasso1_df])
Compa_df
The lasso linear regression model performed with scores 0.73 & .72 in training data set and validation data set respectively. The coefficeints of 1 variable in lasso model is almost '0', signifying that the variable with '0' coefficient can be dropped.
sns.set(style="darkgrid", color_codes=True)
with sns.axes_style("white"):
sns.jointplot(x=y_val, y=y_Lasso1_predvl, kind="reg", color="k")
Ridge1 = Ridge(alpha=0.5)
Ridge1.fit(X_train, y_train)
#predicting result over test data
y_Ridge1_predtr= Ridge1.predict(X_train)
y_Ridge1_predvl= Ridge1.predict(X_val)
Ridge1.coef_
#Model score and Deduction for each Model in a DataFrame
Ridge1_trscore=r2_score(y_train,y_Ridge1_predtr)
Ridge1_trRMSE=np.sqrt(mean_squared_error(y_train, y_Ridge1_predtr))
Ridge1_trMSE=mean_squared_error(y_train, y_Ridge1_predtr)
Ridge1_trMAE=mean_absolute_error(y_train, y_Ridge1_predtr)
Ridge1_vlscore=r2_score(y_val,y_Ridge1_predvl)
Ridge1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_Ridge1_predvl))
Ridge1_vlMSE=mean_squared_error(y_val, y_Ridge1_predvl)
Ridge1_vlMAE=mean_absolute_error(y_val, y_Ridge1_predvl)
Ridge1_df=pd.DataFrame({'Method':['Linear-Reg Ridge1'],'Val Score':Ridge1_vlscore,'RMSE_vl': Ridge1_vlRMSE, 'MSE_vl': Ridge1_vlMSE, 'MAE_vl': Ridge1_vlMAE,'train Score':Ridge1_trscore,'RMSE_tr': Ridge1_trRMSE, 'MSE_tr': Ridge1_trMSE, 'MAE_tr': Ridge1_trMAE})
Compa_df = pd.concat([Compa_df, Ridge1_df])
Compa_df
The Ridge linear regression model performed with scores 0.73 & .72 in training data set and validation data set respectively. The coefficeints of variables in ridge model are all non-zero, indicating that non of the variables can be dropped.
sns.set(style="darkgrid", color_codes=True)
with sns.axes_style("white"):
sns.jointplot(x=y_val, y=y_Ridge1_predvl, kind="reg", color="k")
In summary, Linear models have performed almost with similar results in both regularized model and non-regularized models
from sklearn.neighbors import KNeighborsRegressor
knn1 = KNeighborsRegressor(n_neighbors=4,weights='distance')
knn1.fit(X_train, y_train)
#predicting result over test data
y_knn1_predtr= knn1.predict(X_train)
y_knn1_predvl= knn1.predict(X_val)
#Model score and Deduction for each Model in a DataFrame
knn1_trscore=r2_score(y_train,y_knn1_predtr)
knn1_trRMSE=np.sqrt(mean_squared_error(y_train, y_knn1_predtr))
knn1_trMSE=mean_squared_error(y_train, y_knn1_predtr)
knn1_trMAE=mean_absolute_error(y_train, y_knn1_predtr)
knn1_vlscore=r2_score(y_val,y_knn1_predvl)
knn1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_knn1_predvl))
knn1_vlMSE=mean_squared_error(y_val, y_knn1_predvl)
knn1_vlMAE=mean_absolute_error(y_val, y_knn1_predvl)
knn1_df=pd.DataFrame({'Method':['knn1'],'Val Score':knn1_vlscore,'RMSE_vl': knn1_vlRMSE, 'MSE_vl': knn1_vlMSE, 'MAE_vl': knn1_vlMAE,'train Score':knn1_trscore,'RMSE_tr': knn1_trRMSE, 'MSE_tr': knn1_trMSE, 'MAE_tr': knn1_trMAE})
Compa_df = pd.concat([Compa_df, knn1_df])
Compa_df
Though KNN regressor performed well in training set, the performance score in validation set is very less. This shows that the model is overfitted in training set
from sklearn.svm import SVR
SVR1 = SVR(gamma='auto',C=10.0, epsilon=0.2,kernel='rbf')
SVR1.fit(X_train, y_train)
y_SVR1_predtr= SVR1.predict(X_train)
y_SVR1_predvl= SVR1.predict(X_val)
#Model score and Deduction for each Model in a DataFrame
SVR1_trscore=r2_score(y_train,y_SVR1_predtr)
SVR1_trRMSE=np.sqrt(mean_squared_error(y_train, y_SVR1_predtr))
SVR1_trMSE=mean_squared_error(y_train, y_SVR1_predtr)
SVR1_trMAE=mean_absolute_error(y_train, y_SVR1_predtr)
SVR1_vlscore=r2_score(y_val,y_SVR1_predvl)
SVR1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_SVR1_predvl))
SVR1_vlMSE=mean_squared_error(y_val, y_SVR1_predvl)
SVR1_vlMAE=mean_absolute_error(y_val, y_SVR1_predvl)
SVR1_df=pd.DataFrame({'Method':['SVR1'],'Val Score':SVR1_vlscore,'RMSE_vl': SVR1_vlRMSE, 'MSE_vl': SVR1_vlMSE, 'MAE_vl': SVR1_vlMAE,'train Score':SVR1_trscore,'RMSE_tr': SVR1_trRMSE, 'MSE_tr': SVR1_trMSE, 'MAE_tr': SVR1_trMAE})
Compa_df = pd.concat([Compa_df, SVR1_df])
Compa_df
The above negative scores in SVR model is due to non-learning of the model in the training set which results in non-performance in validation set
SVR2 = SVR(gamma='auto',C=0.1,kernel='linear')
SVR2.fit(X_train, y_train)
y_SVR2_predtr= SVR2.predict(X_train)
y_SVR2_predvl= SVR2.predict(X_val)
#Model score and Deduction for each Model in a DataFrame
SVR2_trscore=r2_score(y_train,y_SVR2_predtr)
SVR2_trRMSE=np.sqrt(mean_squared_error(y_train, y_SVR2_predtr))
SVR2_trMSE=mean_squared_error(y_train, y_SVR2_predtr)
SVR2_trMAE=mean_absolute_error(y_train, y_SVR2_predtr)
SVR2_vlscore=r2_score(y_val,y_SVR2_predvl)
SVR2_vlRMSE=np.sqrt(mean_squared_error(y_val, y_SVR2_predvl))
SVR2_vlMSE=mean_squared_error(y_val, y_SVR2_predvl)
SVR2_vlMAE=mean_absolute_error(y_val, y_SVR2_predvl)
SVR2_df=pd.DataFrame({'Method':['SVR2'],'Val Score':SVR2_vlscore,'RMSE_vl': SVR2_vlRMSE, 'MSE_vl': SVR2_vlMSE, 'MAE_vl': SVR2_vlMAE,'train Score':SVR2_trscore,'RMSE_tr': SVR2_trRMSE, 'MSE_tr': SVR2_trMSE, 'MAE_tr': SVR2_trMAE})
Compa_df = pd.concat([Compa_df, SVR2_df])
Compa_df
The SVR model with modified parameters has not performed well with just ~0.45 in both training and validation data sets
from sklearn.tree import DecisionTreeRegressor
DT1 = DecisionTreeRegressor()
DT1.fit(X_train, y_train)
y_DT1_predtr= DT1.predict(X_train)
y_DT1_predvl= DT1.predict(X_val)
#Model score and Deduction for each Model in a DataFrame
DT1_trscore=r2_score(y_train,y_DT1_predtr)
DT1_trRMSE=np.sqrt(mean_squared_error(y_train, y_DT1_predtr))
DT1_trMSE=mean_squared_error(y_train, y_DT1_predtr)
DT1_trMAE=mean_absolute_error(y_train, y_DT1_predtr)
DT1_vlscore=r2_score(y_val,y_DT1_predvl)
DT1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_DT1_predvl))
DT1_vlMSE=mean_squared_error(y_val, y_DT1_predvl)
DT1_vlMAE=mean_absolute_error(y_val, y_DT1_predvl)
DT1_df=pd.DataFrame({'Method':['DT1'],'Val Score':DT1_vlscore,'RMSE_vl': DT1_vlRMSE, 'MSE_vl': DT1_vlMSE, 'MAE_vl': DT1_vlMAE,'train Score':DT1_trscore,'RMSE_tr': DT1_trRMSE, 'MSE_tr': DT1_trMSE, 'MAE_tr': DT1_trMAE})
Compa_df = pd.concat([Compa_df, DT1_df])
Compa_df
Above performance of initial Decision tree model shows overfit in training set with 0.99 score and low performance in validation set
DT2 = DecisionTreeRegressor(max_depth=10,min_samples_leaf=5)
DT2.fit(X_train, y_train)
y_DT2_predtr= DT2.predict(X_train)
y_DT2_predvl= DT2.predict(X_val)
#Model score and Deduction for each Model in a DataFrame
DT2_trscore=r2_score(y_train,y_DT2_predtr)
DT2_trRMSE=np.sqrt(mean_squared_error(y_train, y_DT2_predtr))
DT2_trMSE=mean_squared_error(y_train, y_DT2_predtr)
DT2_trMAE=mean_absolute_error(y_train, y_DT2_predtr)
DT2_vlscore=r2_score(y_val,y_DT2_predvl)
DT2_vlRMSE=np.sqrt(mean_squared_error(y_val, y_DT2_predvl))
DT2_vlMSE=mean_squared_error(y_val, y_DT2_predvl)
DT2_vlMAE=mean_absolute_error(y_val, y_DT2_predvl)
DT2_df=pd.DataFrame({'Method':['DT2'],'Val Score':DT2_vlscore,'RMSE_vl': DT2_vlRMSE, 'MSE_vl': DT2_vlMSE, 'MAE_vl': DT2_vlMAE,'train Score':DT2_trscore,'RMSE_tr': DT2_trRMSE, 'MSE_tr': DT2_trMSE, 'MAE_tr': DT2_trMAE})
Compa_df = pd.concat([Compa_df, DT2_df])
Compa_df
Above decision tree model with modified parameter has better performed on the training set and validation set compared to initial decision tree model.But overall decision tree has not performed well than linear regression models.
sns.set(style="darkgrid", color_codes=True)
with sns.axes_style("white"):
sns.jointplot(x=y_val, y=y_DT2_predvl, kind="reg", color="k")
In summary, KNN regressor model and decision tree models have not performed well in comparison with linear regression models
from sklearn.ensemble import GradientBoostingRegressor, BaggingRegressor
GB1=GradientBoostingRegressor(n_estimators = 200, learning_rate = 0.1, random_state=22)
GB1.fit(X_train, y_train)
y_GB1_predtr= GB1.predict(X_train)
y_GB1_predvl= GB1.predict(X_val)
#Model score and Deduction for each Model in a DataFrame
GB1_trscore=r2_score(y_train,y_GB1_predtr)
GB1_trRMSE=np.sqrt(mean_squared_error(y_train, y_GB1_predtr))
GB1_trMSE=mean_squared_error(y_train, y_GB1_predtr)
GB1_trMAE=mean_absolute_error(y_train, y_GB1_predtr)
GB1_vlscore=r2_score(y_val,y_GB1_predvl)
GB1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_GB1_predvl))
GB1_vlMSE=mean_squared_error(y_val, y_GB1_predvl)
GB1_vlMAE=mean_absolute_error(y_val, y_GB1_predvl)
GB1_df=pd.DataFrame({'Method':['GB1'],'Val Score':GB1_vlscore,'RMSE_vl': GB1_vlRMSE, 'MSE_vl': GB1_vlMSE, 'MAE_vl': GB1_vlMAE,'train Score':GB1_trscore,'RMSE_tr': GB1_trRMSE, 'MSE_tr': GB1_trMSE, 'MAE_tr': GB1_trMAE})
Compa_df = pd.concat([Compa_df, GB1_df])
Compa_df
Gradient boosting model has provided good scores in both training and validation sets
BGG1=BaggingRegressor(n_estimators=50, oob_score= True,random_state=14)
BGG1.fit(X_train, y_train)
y_BGG1_predtr= BGG1.predict(X_train)
y_BGG1_predvl= BGG1.predict(X_val)
#Model score and Deduction for each Model in a DataFrame
BGG1_trscore=r2_score(y_train,y_BGG1_predtr)
BGG1_trRMSE=np.sqrt(mean_squared_error(y_train, y_BGG1_predtr))
BGG1_trMSE=mean_squared_error(y_train, y_BGG1_predtr)
BGG1_trMAE=mean_absolute_error(y_train, y_BGG1_predtr)
BGG1_vlscore=r2_score(y_val,y_BGG1_predvl)
BGG1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_BGG1_predvl))
BGG1_vlMSE=mean_squared_error(y_val, y_BGG1_predvl)
BGG1_vlMAE=mean_absolute_error(y_val, y_BGG1_predvl)
BGG1_df=pd.DataFrame({'Method':['BGG1'],'Val Score':BGG1_vlscore,'RMSE_vl': BGG1_vlRMSE, 'MSE_vl':BGG1_vlMSE, 'MAE_vl': BGG1_vlMAE,'train Score':BGG1_trscore,'RMSE_tr': BGG1_trRMSE, 'MSE_tr': BGG1_trMSE, 'MAE_tr': BGG1_trMAE})
Compa_df = pd.concat([Compa_df, BGG1_df])
Compa_df
Bagging model also performed well in training and validation sets.There seems to be overfitting in training set. We need to analyse further by hypertuning
from sklearn.ensemble import RandomForestRegressor
RF1=RandomForestRegressor()
RF1.fit(X_train, y_train)
y_RF1_predtr= RF1.predict(X_train)
y_RF1_predvl= RF1.predict(X_val)
#Model score and Deduction for each Model in a DataFrame
RF1_trscore=r2_score(y_train,y_RF1_predtr)
RF1_trRMSE=np.sqrt(mean_squared_error(y_train, y_RF1_predtr))
RF1_trMSE=mean_squared_error(y_train, y_RF1_predtr)
RF1_trMAE=mean_absolute_error(y_train, y_RF1_predtr)
RF1_vlscore=r2_score(y_val,y_RF1_predvl)
RF1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_RF1_predvl))
RF1_vlMSE=mean_squared_error(y_val, y_RF1_predvl)
RF1_vlMAE=mean_absolute_error(y_val, y_RF1_predvl)
RF1_df=pd.DataFrame({'Method':['RF1'],'Val Score':RF1_vlscore,'RMSE_vl': RF1_vlRMSE, 'MSE_vl':RF1_vlMSE, 'MAE_vl': RF1_vlMAE,'train Score':RF1_trscore,'RMSE_tr': RF1_trRMSE, 'MSE_tr': RF1_trMSE, 'MAE_tr': RF1_trMAE})
Compa_df = pd.concat([Compa_df, RF1_df])
Compa_df
Random forest model has performed well in training and validation set. There is scope of further analysis on this model
#feature importance
rf_imp_feature_1=pd.DataFrame(RF1.feature_importances_, columns = ["Imp"], index = X_val.columns)
rf_imp_feature_1.sort_values(by="Imp",ascending=False)
rf_imp_feature_1['Imp'] = rf_imp_feature_1['Imp'].map('{0:.5f}'.format)
rf_imp_feature_1=rf_imp_feature_1.sort_values(by="Imp",ascending=False)
rf_imp_feature_1.Imp=rf_imp_feature_1.Imp.astype("float")
rf_imp_feature_1[:30].plot.bar(figsize=(plotSizeX, plotSizeY))
#First 20 features have an importance of 90.5% and first 30 have importance of 95.15
print("First 20 feature importance:\t",(rf_imp_feature_1[:20].sum())*100)
print("First 30 feature importance:\t",(rf_imp_feature_1[:30].sum())*100)
Above are top 30 important features that account for 95% of variation in model. This need to be further analysed during hypertuning of the models for better scores
Ensemble methods are performing better than linear models. Of all the ensemble models, Gradient boosting regressor is giving better R2 score. we identified top 30 features that are explaining the 95% variation in model(Random Forest). Will further hypertune the model to improve the model performance. Will further explore and evaluate the features while hyperturning the ensemble models
rf_imp_feature_1[:30]
from sklearn.pipeline import Pipeline
def result (model,pipe_model,X_train_set,y_train_set,X_val_set,y_val_set):
pipe_model.fit(X_train_set,y_train_set)
#predicting result over test data
y_train_predict= pipe_model.predict(X_train_set)
y_val_predict= pipe_model.predict(X_val_set)
trscore=r2_score(y_train_set,y_train_predict)
trRMSE=np.sqrt(mean_squared_error(y_train_set,y_train_predict))
trMSE=mean_squared_error(y_train_set,y_train_predict)
trMAE=mean_absolute_error(y_train_set,y_train_predict)
vlscore=r2_score(y_val,y_val_predict)
vlRMSE=np.sqrt(mean_squared_error(y_val,y_val_predict))
vlMSE=mean_squared_error(y_val,y_val_predict)
vlMAE=mean_absolute_error(y_val,y_val_predict)
result_df=pd.DataFrame({'Method':[model],'val score':vlscore,'RMSE_val':vlRMSE,'MSE_val':vlMSE,'MSE_vl': vlMSE,
'train Score':trscore,'RMSE_tr': trRMSE,'MSE_tr': trMSE, 'MAE_tr': trMAE})
return result_df
Above function will run the model and return the r2 score,rmse,mse of the model
#Creating empty dataframe to capture results
result_dff=pd.DataFrame()
pipe_LR = Pipeline([('LR', LinearRegression())])
result_dff=pd.concat([result_dff,result('LR',pipe_LR,X_train,y_train,X_val,y_val)])
pipe_knr = Pipeline([('KNNR', KNeighborsRegressor(n_neighbors=4,weights='distance'))])
result_dff=pd.concat([result_dff,result('KNNR',pipe_knr,X_train,y_train,X_val,y_val)])
pipe_DTR = Pipeline([('DTR', DecisionTreeRegressor())])
result_dff=pd.concat([result_dff,result('DTR',pipe_DTR,X_train,y_train,X_val,y_val)])
pipe_GBR = Pipeline([('GBR', GradientBoostingRegressor(n_estimators = 200, learning_rate = 0.1, random_state=22))])
result_dff=pd.concat([result_dff,result('GBR',pipe_GBR,X_train,y_train,X_val,y_val)])
pipe_BGR = Pipeline([('BGR', BaggingRegressor(n_estimators=50, oob_score= True,random_state=14))])
result_dff=pd.concat([result_dff,result('BGR',pipe_BGR,X_train,y_train,X_val,y_val)])
pipe_RFR = Pipeline([('RFR', RandomForestRegressor())])
result_dff=pd.concat([result_dff,result('RFR',pipe_RFR,X_train,y_train,X_val,y_val)])
result_dff
Above sequence of steps with pipeline function will run all the models and compile the scores in result_dff dataframe. We can see that the above 2 steps are concise instead of running individual models and compiling the scores as earlier.
#Storing results of initial data set - dff
result_ds1=result_dff.copy()
result_ds1
Now, we will explore the possibility of features reduction using PCA
dff.shape
dff.columns
will drop the price column as it is the target variable
df_pca = dff.drop(['price'], axis = 1)
numerical_cols = df_pca.copy()
numerical_cols.shape
# Let's first transform the entire X (independent variable data) to zscores.
# We will create the PCA dimensions on this distribution.
from scipy.stats import zscore
# As PCA for Independent columns of Numerical types, let's pass numerical_cols (16 numerical features)
numerical_cols = numerical_cols.apply(zscore)
cov_matrix = np.cov(numerical_cols.T)
print('Covariance Matrix \n%s', cov_matrix)
As we can see, near the value to 1, more the features related.
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eigenvectors)
print('\n Eigen Values \n%s', eigenvalues)
# Let's Sort eigenvalues in descending order
# Make a set of (eigenvalue, eigenvector) pairs
eig_pairs = [(eigenvalues[index], eigenvectors[:,index]) for index in range(len(eigenvalues))]
# Sort the (eigenvalue, eigenvector) pairs from highest to lowest with respect to eigenvalue
eig_pairs.sort()
eig_pairs.reverse()
print(eig_pairs)
# Extract the descending ordered eigenvalues and eigenvectors
eigvalues_sorted = [eig_pairs[index][0] for index in range(len(eigenvalues))]
eigvectors_sorted = [eig_pairs[index][1] for index in range(len(eigenvalues))]
# Let's confirm our sorting worked, print out eigenvalues
print('Eigenvalues in descending order: \n%s' %eigvalues_sorted)
tot = sum(eigenvalues)
var_explained = [(i / tot) for i in sorted(eigenvalues, reverse=True)]
# an array of variance explained by each
# eigen vector... there will be 90 entries as there are 90 eigen vectors)
cum_var_exp = np.cumsum(var_explained)
# an array of cumulative variance. There will be 90 entries with 90 th entry cumulative reaching almost 100%
print(len(var_explained))
print((cum_var_exp))
From above table we conclude that 96% variance is contributed by about 72 features
plt.figure(figsize=(plotSizeX, plotSizeY))
plt.bar(range(0,90), np.array(var_explained), alpha = 0.5, align='center', label='individual explained variance')
plt.step(range(0,90), np.array(cum_var_exp), where= 'mid', label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc = 'best')
plt.show()
Now will recall the ensemble models from our initial run to check the feature selection using featureimp from individual models
#Building fuction to return the feature importances for the model
predictors = [x for x in dff.columns if x not in ['price']]
def modelfit(alg, dxtrain, dytrain, printFeatureImportance=True):
#feature importance
alg.fit(dxtrain,dytrain)
alg_imp_feature_1=pd.DataFrame(alg.feature_importances_, columns = ["Imp"], index = predictors)
alg_imp_feature_1.sort_values(by="Imp",ascending=False)
alg_imp_feature_1['Imp'] = alg_imp_feature_1['Imp'].map('{0:.5f}'.format)
alg_imp_feature_1=alg_imp_feature_1.sort_values(by="Imp",ascending=False)
alg_imp_feature_1.Imp=alg_imp_feature_1.Imp.astype("float")
feat_30list=list(alg_imp_feature_1.index[:30])
if printFeatureImportance:
alg_imp_feature_1[:30].plot.bar(figsize=(plotSizeX, plotSizeY))
#First 20 features have an importance of 90.5% and first 30 have importance of 95.15
print("First 25 feature importance:\t",(alg_imp_feature_1[:25].sum())*100)
print("First 30 feature importance:\t",(alg_imp_feature_1[:30].sum())*100)
return feat_30list
Will run above function with ensemble models: Gradient boosting, Random forest, Bagging
#Gradient boost model
modelfit(GB1,X_train,y_train)
The top 30 features are covering about 98% in gradient boosting model. This is very good coverage for just 30% of the variables
#Random Forest model
modelfit(RF1,X_train,y_train)
The top 30 features are covering about 95% in random forest model
Now will extract the top 30 features from the above models
feat_list_GB1=modelfit(GB1,X_train,y_train, printFeatureImportance=False)
print(feat_list_GB1)
feat_list_RF1=modelfit(RF1,X_train,y_train, printFeatureImportance=False)
print(feat_list_RF1)
From the above 2 feature list, we will consolidate all the features
Key_feat=list(set(feat_list_GB1).union(feat_list_RF1))
print(len(Key_feat))
print(Key_feat)
From two models we have 33 importance features. We will freeze on the above 33 list and make another dataframe (along with 'price')
dff33=dff[['price','basement', 'City_Bellevue', 'coast_1', 'HouseLandRatio', 'City_Seattle', 'quality_10', 'quality_9', 'ceil_measure', 'City_Renton', 'City_Redmond', 'City_Federal Way', 'City_Mercer Island', 'yr_built', 'living_measure15', 'living_measure', 'City_Maple Valley', 'sight_3', 'total_area', 'City_Kirkland', 'sight_4', 'quality_6', 'quality_7', 'City_Sammamish', 'quality_8', 'City_Kent', 'quality_12', 'lot_measure', 'condition_3', 'furnished_1', 'City_Issaquah', 'quality_11', 'City_Medina', 'lot_measure15']].copy()
dff33.shape
dff33.head()
X3 = dff33.drop("price" , axis=1)
y3 = dff33["price"]
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.2, random_state=10)
X3_train, X3_val, y3_train, y3_val = train_test_split(X3_train, y3_train, test_size=0.2, random_state=10)
print(X3_train.shape)
print(X3_test.shape)
print(X3_val.shape)
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
Since we have better performance in gradient boosting model, we will hypertune the model for improving the score
Following are the parameters we tune for the gradient boosting model.
param_grid = {
'loss':['ls','lad','huber'],
'bootstrap': ['True','False'],
'max_depth': range(5,11,1),
'max_features': ['auto','sqrt'],
'learning_rate': [0.05,0.1,0.2,0.25],
'min_samples_leaf': [4,10,20],
'min_samples_split': [5,10,1000],
'n_estimators': [10,50,100,150,200],
'subsample':[0.8,1]
}
GBR_test=GradientBoostingRegressor(random_state=22)
First will tune each parameter separately
param_grid1 = {'n_estimators': range(50,401,50)}
grid_search1 = GridSearchCV(estimator = GBR_test, param_grid = param_grid1,
cv = 3, n_jobs = 2, verbose = 1)
grid_search1.fit(X_train,y_train)
grid_search1.best_params_
grid_search1.best_params_, grid_search1.best_score_
n_estimators of 400 is best in range 50 to 400. Will test same until 1000
param_grid2 = {'n_estimators': range(400,1001,200)}
GBR_test=GradientBoostingRegressor(random_state=22)
grid_search2 = GridSearchCV(estimator = GBR_test, param_grid = param_grid2,
cv = 3, n_jobs = 2, verbose = 1)
grid_search2.fit(X_train,y_train)
grid_search2.cv_results_,grid_search2.best_params_, grid_search2.best_score_
param_grid2 = {'n_estimators': range(1000,2000,300)}
GBR_test=GradientBoostingRegressor(random_state=22)
grid_search2 = GridSearchCV(estimator = GBR_test, param_grid = param_grid2,
cv = 5, n_jobs = 3, verbose = 1)
grid_search2.fit(X_train,y_train)
grid_search2.best_params_, grid_search2.best_score_
n_estimators of 1000 is giving best result in range 400 to 1000
param_grid3 = {
'learning_rate': [0.1,0.2],
'min_samples_leaf': [5,10,20],
'min_samples_split': [5,10,20],
'n_estimators': [500,1000],
}
GBR_test=GradientBoostingRegressor(random_state=22)
grid_search3 = GridSearchCV(estimator = GBR_test, param_grid = param_grid3,
cv = 5, n_jobs = 3, verbose = 1)
grid_search3.fit(X_train,y_train)
grid_search3.best_params_, grid_search3.best_score_
In combination of 4 parameters above values are giving best result. We can see n_estimators of 1000 is best again. Now, will change the ranges of other 3 parameters
param_grid4 = {
'learning_rate': [0.1,0.15],
'max_depth': [5,10],
'min_samples_leaf': [5,8],
'min_samples_split': [20,30],
'n_estimators': [1000],
}
GBR_test=GradientBoostingRegressor(random_state=22)
grid_search4 = GridSearchCV(estimator = GBR_test, param_grid = param_grid4,
cv = 5, n_jobs = 3, verbose = 1)
grid_search4.fit(X_train,y_train)
grid_search4.best_params_, grid_search4.best_score_
Now the score has reduced compared to earlier run
param_grid5 = {
'learning_rate': [0.1],
'max_depth': [5],
'min_samples_leaf': [8,10],
'min_samples_split': [30,40],
'n_estimators': [1000],
}
GBR_test=GradientBoostingRegressor(random_state=22)
grid_search5 = GridSearchCV(estimator = GBR_test, param_grid = param_grid5,
cv = 5, n_jobs = 2, verbose = 1)
grid_search5.fit(X_train,y_train)
grid_search5.best_params_, grid_search5.best_score_
Above score has improved from earlier runs
param_grid6 = {
'learning_rate': [0.1],
'max_depth': [5],
'min_samples_leaf': [8],
'min_samples_split': [40,50],
'n_estimators': [1000],
}
GBR_test=GradientBoostingRegressor(random_state=22)
grid_search6 = GridSearchCV(estimator = GBR_test, param_grid = param_grid6,
cv = 5, n_jobs = 2, verbose = 1)
grid_search6.fit(X_train,y_train)
grid_search6.best_params_, grid_search6.best_score_
There is very marginal improvment in score. We are getting best score at min_samples_split of 40 among 30,40,50.
param_grid7 = {
'loss':['ls','lad','huber'],
'max_features': ['auto','sqrt'],
'learning_rate': [0.1],
'max_depth': [5],
'min_samples_leaf': [8],
'min_samples_split': [40],
'n_estimators': [1000],
'subsample':[0.8,1]
}
GBR_test=GradientBoostingRegressor(random_state=22)
grid_search7 = GridSearchCV(estimator = GBR_test, param_grid = param_grid7,
cv = 5, n_jobs = 2, verbose = 1)
grid_search7.fit(X_train,y_train)
grid_search7.best_params_, grid_search7.best_score_
There is improvement in the score. will try one more iteration with changing other parameters
param_gridF = {
'loss':['huber'],
'max_features': ['sqrt'],
'learning_rate': [0.1,0.2],
'max_depth': [5,8],
'min_samples_leaf': [5],
'min_samples_split': [40,50],
'n_estimators': [1000],
'subsample':[1]
}
GBR_test=GradientBoostingRegressor(random_state=22)
grid_searchF = GridSearchCV(estimator = GBR_test, param_grid = param_gridF,
cv = 5, n_jobs = 2, verbose = 1)
grid_searchF.fit(X_train,y_train)
grid_searchF.best_params_,grid_searchF.best_score_
'learning_rate': 0.1, 'loss': 'huber', 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 5, 'min_samples_split': 50, 'n_estimators': 1000, 'subsample': 1 </b>
min_samples_leafs = range(1, 15, 1)
train_results = []
val_results = []
for min_samples_leaf in min_samples_leafs:
GBR_test=GradientBoostingRegressor(
loss='huber',
learning_rate=0.1,
n_estimators=1000,
subsample=1.0,
min_samples_split=40,
min_samples_leaf=min_samples_leaf,
max_depth=5,
random_state=22,
alpha=0.9,
)
GBR_test.fit(X_train,y_train)
y_GBR_predtr= GBR_test.predict(X_train)
y_GBR_predvl= GBR_test.predict(X_val)
result_leafs_tr=r2_score(y_GBR_predtr,y_train)
train_results.append(result_leafs_tr)
result_leafs_vl=r2_score(y_GBR_predvl,y_val)
val_results.append(result_leafs_vl)
from matplotlib.legend_handler import HandlerLine2D
line1, = plt.plot(min_samples_leafs,train_results,"b", label='Train r2')
line2, = plt.plot(min_samples_leafs, val_results,"r", label='val r2')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("r2 score")
plt.xlabel("min samples leaf")
plt.show()
From above, min_samples_leaf of 6 is giving best score
min_samples_splits = [10,15,30,50,100,500,700,1000]
train_results_spt = []
val_results_spt = []
for min_samples_split in min_samples_splits:
GBR_test=GradientBoostingRegressor(
loss='huber',
learning_rate=0.1,
n_estimators=1000,
subsample=1.0,
min_samples_split=min_samples_split,
min_samples_leaf=5,
max_depth=5,
random_state=22,
alpha=0.9,
)
GBR_test.fit(X_train,y_train)
y_GBR_predtr= GBR_test.predict(X_train)
y_GBR_predvl= GBR_test.predict(X_val)
result_spt_tr=r2_score(y_GBR_predtr,y_train)
train_results_spt.append(result_spt_tr)
result_spt_vl=r2_score(y_GBR_predvl,y_val)
val_results_spt.append(result_spt_vl)
from matplotlib.legend_handler import HandlerLine2D
line1, = plt.plot(min_samples_splits,train_results_spt,"b", label='Train R2')
line2, = plt.plot(min_samples_splits, val_results_spt,"r", label='Val R2')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("R2 score")
plt.xlabel("min samples split")
plt.show()
From above, min_samples_splits of about 10 is giving best score. Will try expanding the range around 10
min_samples_splits = [10,15,20,30,40,50,60,70,80,90,100]
train_results_spt = []
val_results_spt = []
for min_samples_split in min_samples_splits:
GBR_test=GradientBoostingRegressor(
loss='huber',
learning_rate=0.1,
n_estimators=1000,
subsample=1.0,
min_samples_split=min_samples_split,
min_samples_leaf=5,
max_depth=5,
random_state=22,
alpha=0.9,
)
GBR_test.fit(X_train,y_train)
y_GBR_predtr= GBR_test.predict(X_train)
y_GBR_predvl= GBR_test.predict(X_val)
result_spt_tr=r2_score(y_GBR_predtr,y_train)
train_results_spt.append(result_spt_tr)
result_spt_vl=r2_score(y_GBR_predvl,y_val)
val_results_spt.append(result_spt_vl)
from matplotlib.legend_handler import HandlerLine2D
line1, = plt.plot(min_samples_splits,train_results_spt,"b", label='Train R2')
line2, = plt.plot(min_samples_splits, val_results_spt,"r", label='Val R2')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("R2 score")
plt.xlabel("min samples split")
plt.show()
From above, min_samples_splits of about 10 is giving best score
min_samples_splits = [7,8,9,10,11,12,13,14,15,20]
train_results_spt = []
val_results_spt = []
for min_samples_split in min_samples_splits:
GBR_test=GradientBoostingRegressor(
loss='huber',
learning_rate=0.1,
n_estimators=1000,
subsample=1.0,
min_samples_split=min_samples_split,
min_samples_leaf=5,
max_depth=5,
random_state=22,
alpha=0.9,
)
GBR_test.fit(X_train,y_train)
y_GBR_predtr= GBR_test.predict(X_train)
y_GBR_predvl= GBR_test.predict(X_val)
result_spt_tr=r2_score(y_GBR_predtr,y_train)
train_results_spt.append(result_spt_tr)
result_spt_vl=r2_score(y_GBR_predvl,y_val)
val_results_spt.append(result_spt_vl)
from matplotlib.legend_handler import HandlerLine2D
line1, = plt.plot(min_samples_splits,train_results_spt,"b", label='Train R2')
line2, = plt.plot(min_samples_splits, val_results_spt,"r", label='Val R2')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("R2 score")
plt.xlabel("min samples split")
plt.show()
From above, min_samples_splits of about 12 is giving best score
max_depths = range(3,11,1)
train_results_dpt = []
val_results_dpt = []
for max_depth in max_depths:
GBR_test=GradientBoostingRegressor(
loss='huber',
learning_rate=0.1,
n_estimators=1000,
subsample=1.0,
min_samples_split=10,
min_samples_leaf=6,
max_depth=max_depth,
random_state=22,
alpha=0.9,
)
GBR_test.fit(X_train,y_train)
y_GBR_predtr= GBR_test.predict(X_train)
y_GBR_predvl= GBR_test.predict(X_val)
result_dpt_tr=r2_score(y_GBR_predtr,y_train)
train_results_dpt.append(result_dpt_tr)
result_dpt_vl=r2_score(y_GBR_predvl,y_val)
val_results_dpt.append(result_dpt_vl)
from matplotlib.legend_handler import HandlerLine2D
line1, = plt.plot(max_depths,train_results_dpt,"b", label='Train R2')
line2, = plt.plot(max_depths, val_results_dpt,"r", label='Val R2')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("R2 score")
plt.xlabel("max depth")
plt.show()
From above, max_depth of about 6 is giving best score for validation set and not overfitting of training set
estimators = range(100,1500,100)
train_results_est = []
val_results_est = []
for n_estimators in estimators:
GBR_test=GradientBoostingRegressor(
loss='huber',
learning_rate=0.1,
n_estimators=n_estimators,
subsample=1.0,
min_samples_split=30,
min_samples_leaf=6,
max_depth=9,
random_state=22,
alpha=0.9,
)
GBR_test.fit(X_train,y_train)
y_GBR_predtr= GBR_test.predict(X_train)
y_GBR_predvl= GBR_test.predict(X_val)
result_est_tr=r2_score(y_GBR_predtr,y_train)
train_results_est.append(result_est_tr)
result_est_vl=r2_score(y_GBR_predvl,y_val)
val_results_est.append(result_est_vl)
from matplotlib.legend_handler import HandlerLine2D
line1, = plt.plot(estimators,train_results_est,"b", label='Train R2')
line2, = plt.plot(estimators, val_results_est,"r", label='Val R2')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("R2 score")
plt.xlabel("n_estimators")
plt.show()
From above, n_estimators of about 1000 is giving best score
param_gridF = {
'loss':['huber'],
'max_features': ['sqrt'],
'learning_rate': [0.1],
'max_depth': [6],
'min_samples_leaf': [6],
'min_samples_split': [12],
'n_estimators': [1000],
'subsample':[1]
}
GBR_test=GradientBoostingRegressor(random_state=22)
grid_searchF = GridSearchCV(estimator = GBR_test, param_grid = param_gridF,
cv = 5, n_jobs = 2, verbose = 1)
grid_searchF.fit(X_train,y_train)
grid_searchF.best_score_
param_gridF = {
'loss':['huber'],
'max_features': ['sqrt'],
'learning_rate': [0.1],
'max_depth': [5],
'min_samples_leaf': [5],
'min_samples_split': [50],
'n_estimators': [1000],
'subsample':[1]
}
GBR_test=GradientBoostingRegressor(random_state=22)
grid_searchF = GridSearchCV(estimator = GBR_test, param_grid = param_gridF,
cv = 5, n_jobs = 2, verbose = 1)
grid_searchF.fit(X_train,y_train)
grid_searchF.best_score_,grid_searchF.best_params_
We can conclude from above that gridsearch CV is giving better results compared to that of tuning done by graphical method of individual parameters
'learning_rate': 0.1, 'loss': 'huber', 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 5, 'min_samples_split': 50, 'n_estimators': 1000, 'subsample': 1 </b>
GBR_bestparam=GradientBoostingRegressor(
loss='huber',
learning_rate=0.1,
n_estimators=1000,
subsample=1.0,
min_samples_split=50,
min_samples_leaf=5,
max_depth=5,
random_state=22,
alpha=0.9,
)
GBR_bestparam.fit(X_train,y_train)
y_GBRF_predtr= GBR_bestparam.predict(X_train)
y_GBRF_predvl= GBR_bestparam.predict(X_val)
y_GBRF_predts= GBR_bestparam.predict(X_test)
#Model score and Deduction for each Model in a DataFrame
GBRF_trscore=r2_score(y_train,y_GBRF_predtr)
GBRF_trRMSE=np.sqrt(mean_squared_error(y_train, y_GBRF_predtr))
GBRF_trMSE=mean_squared_error(y_train, y_GBRF_predtr)
GBRF_trMAE=mean_absolute_error(y_train, y_GBRF_predtr)
GBRF_vlscore=r2_score(y_val,y_GBRF_predvl)
GBRF_vlRMSE=np.sqrt(mean_squared_error(y_val, y_GBRF_predvl))
GBRF_vlMSE=mean_squared_error(y_val, y_GBRF_predvl)
GBRF_vlMAE=mean_absolute_error(y_val, y_GBRF_predvl)
GBRF_tsscore=r2_score(y_test,y_GBRF_predts)
GBRF_tsRMSE=np.sqrt(mean_squared_error(y_test, y_GBRF_predts))
GBRF_tsMSE=mean_squared_error(y_test, y_GBRF_predts)
GBRF_tsMAE=mean_absolute_error(y_test, y_GBRF_predts)
GBRF_df=pd.DataFrame({'Method':['GBRF'],'Val Score':GBRF_vlscore,'RMSE_vl': GBRF_vlRMSE, 'MSE_vl': GBRF_vlMSE,'train Score':GBRF_trscore,'RMSE_tr': GBRF_trRMSE, 'MSE_tr': GBRF_trMSE,'test Score':GBRF_tsscore,'RMSE_ts': GBRF_tsRMSE, 'MSE_ts': GBRF_tsMSE})
GBRF_df
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
num_folds = 50
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
model = GradientBoostingRegressor(n_estimators = 200, learning_rate = 0.1, random_state=22)
results = cross_val_score(GBR_bestparam, X, y, cv=kfold)
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
from matplotlib import pyplot
# plot scores
pyplot.hist(results)
pyplot.show()
# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on each side indicated by P value (border)
lower = max(0.0, np.percentile(results, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(results, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100, upper*100))
import geopandas as gpd
from shapely.geometry import Point, Polygon
#For current working directory
import os
cwd = os.getcwd()
## Need to add file USA ZipCodes_1.xlsx to current working directory to access this data
USAZip=pd.read_excel("USA ZipCodes_1.xlsx",sheet_name="Sheet8")
USAZip.head()
house_df = pd.read_csv('innercity.csv')
house_df1=house_df.merge(USAZip,how='left',on='zipcode')
#house_df.drop_duplicates()
house_df.shape
#Add the folder WA to your current working directory
usa = gpd.read_file(cwd+'\\WA\\WSDOT__City_Limits.shp')
usa.head()
gdf = gpd.GeoDataFrame(
house_df,geometry = [Point(xy) for xy in zip(house_df['long'], house_df['lat'])])
#We can now plot our ``GeoDataFrame``
ax=usa[usa.CityName.isin(house_df.City.unique())].plot(
color='white', edgecolor='black',figsize=(20,8))
plt.figure(figsize=(15,15))
gdf.plot(ax=ax, color='green', marker='o',markersize=0.1)
#After analysis in p1 - Dropping 'cid','dayhours','basement','yr_built','yr_renovated','zipcode','lat','long','County','Type',
#'geometry','quality_group','month_year' columns.
cols=['cid','dayhours']
house_df_1=house_df.drop(cols, inplace = False, axis = 1)
The dataset worked earlier are giving r2 score on validation set in range 70%-75% with RMSE in range of 96000 to 155000. Trying with a different dataset to see if this could be improved further.
For analysis in this iteration categorizing coast, furnished and quality. As in previous version tranformed many features but not got desired result.
Removing data points which fall into below criteria:
We have lost 20 records which is 0.09% of the data available. These records are extreme values for which we dont have much of data to provide their better estimate. Hence removing them.
house_df_2=house_df_1[(house_df['living_measure']<=9000) & (house_df_1['price']<=4000000) &
(house_df_1['room_bed']<=10) & (house_df_1['room_bath']<=6) ]
house_df_2.shape
house_df_2.columns
# Convert into dummies
house_df_final = pd.get_dummies(house_df_2, columns=['coast', 'quality', 'furnished'],drop_first=True)
house_df_final.columns
house_df_final.shape
#Final Data columns
house_df_final.columns
#total_area is highly correlated with lot_measure, ceil_measure is highly correlated with living_measure
house_corr_2 = house_df_final.corr(method ='pearson')
house_corr_2.to_excel("house_corr_2.xls")
plt.figure(figsize=(35,20))
sns.heatmap(house_corr_2,cmap="coolwarm", annot=True,annot_kws={"size":9},fmt='.2')
#creating a copy of the final dataframe
dff2=house_df_final.copy()
df_train, df_test = train_test_split(dff2, test_size=0.2, random_state=10)
df_train, df_val = train_test_split(df_train, test_size=0.2, random_state=10)
print(df_train.shape)
print(df_test.shape)
print(df_val.shape)
# Split the 'df_train' set into X and y
X_train2 = df_train.drop(['price'],axis=1)
y_train2 = df_train['price']
len_train=len(X_train2)
X_train2.shape
y_train2.head()
# Split the 'df_val' set into X and y
X_val2 = df_val.drop(['price'],axis=1)
y_val2 = df_val['price']
len_val=len(X_val2)
X_val2.shape
y_val2.head()
# Split the 'df_test' set into X and y
X_test2 = df_test.drop(['price'],axis=1)
y_test2 = df_test['price']
X_test2.shape
len_test=len(X_test2)
y_test2.head()
#Creating empty dataframe to capture results
result_dff=pd.DataFrame()
#Function to give results of the models for its train and validation dataset.
#as input it requries model name to display, algorithm, train indepedent variables, train dependent variable,
#validation indepedent variables, validation dependent variable.
def result (model,pipe_model,X_train_set,y_train_set,X_val_set,y_val_set):
pipe_model.fit(X_train_set,y_train_set)
#predicting result over test data
y_train_predict= pipe_model.predict(X_train_set)
y_val_predict= pipe_model.predict(X_val_set)
trscore=r2_score(y_train_set,y_train_predict)
trRMSE=np.sqrt(mean_squared_error(y_train_set,y_train_predict))
trMSE=mean_squared_error(y_train_set,y_train_predict)
trMAE=mean_absolute_error(y_train_set,y_train_predict)
vlscore=r2_score(y_val,y_val_predict)
vlRMSE=np.sqrt(mean_squared_error(y_val,y_val_predict))
vlMSE=mean_squared_error(y_val,y_val_predict)
vlMAE=mean_absolute_error(y_val,y_val_predict)
result_df=pd.DataFrame({'Method':[model],'val score':vlscore,'RMSE_val':vlRMSE,'MSE_val':vlMSE,'MAE_vl': vlMAE,
'train Score':trscore,'RMSE_tr': trRMSE,'MSE_tr': trMSE, 'MAE_tr': trMAE})
#Plot between actual and predicted values
plt.figure(figsize=(18,10))
sns.lineplot(range(len(y_val_set)),y_val_set,color='blue',linewidth=1.5)
sns.lineplot(range(len(y_val_set)),y_val_predict,color='hotpink',linewidth=.5)
plt.title('Actual and Predicted', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=10) # X-label
plt.ylabel('Values', fontsize=10) # Y-label
return result_df
#Starting with RFE first as there are many features
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
clf=LinearRegression()
pipe_lr = Pipeline([('LR', clf)])
result_dff=pd.concat([result_dff,result('Linear Reg',pipe_lr,X_train,y_train,X_val,y_val)])
result_dff
#checking the magnitude of coefficients
predictors = X_train.columns
coef = pd.Series(clf.coef_,predictors).sort_values()
coef.plot(kind='bar', title='Model Coefficients',color='darkblue',figsize=(10,5))
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler
clf=Ridge()
pipe_ridge = Pipeline([('Ridge', clf)])
result_dff=pd.concat([result_dff,result('Ridge_Reg_1',pipe_ridge,X_train,y_train,X_val,y_val)])
result_dff
#checking the magnitude of coefficients
predictors = X_train.columns
coef = pd.Series(clf.coef_,predictors).sort_values()
coef.plot(kind='bar', title='Model Coefficients',color='darkblue',figsize=(10,5))
#Iteration 2
clf=Ridge(alpha=0.08)
pipe_ridge_1 = Pipeline([('Ridge',clf )])
result_dff=pd.concat([result_dff,result('Ridge_Reg_2',pipe_ridge_1,X_train,y_train,X_val,y_val)])
result_dff
#checking the magnitude of coefficients
predictors = X_train.columns
coef = pd.Series(clf.coef_,predictors).sort_values()
coef.plot(kind='bar', title='Model Coefficients',color='darkblue',figsize=(10,5))
from sklearn.linear_model import Lasso
clf=Lasso(alpha=10, max_iter=1000)
pipe_lasso_1 = Pipeline([('Lasso',clf )])
result_dff=pd.concat([result_dff,result('Lasso_Reg_1',pipe_lasso_1,X_train,y_train,X_val,y_val)])
result_dff
#checking the magnitude of coefficients
predictors = X_train.columns
coef = pd.Series(clf.coef_,predictors).sort_values(ascending=False)
coef
from sklearn.neighbors import KNeighborsRegressor
pipe_knr = Pipeline([('KNNR', KNeighborsRegressor(n_neighbors=20,weights='distance'))])
result_dff=pd.concat([result_dff,result('KNN Reg',pipe_knr,X_train,y_train,X_val,y_val)])
result_dff
#The model is not performing well at all.
#from sklearn.svm import SVR
#from sklearn.preprocessing import StandardScaler
#pipe_svr_1 = Pipeline([('scl', StandardScaler()),('SVR_1', SVR(kernel='rbf'))])
#result_dff=pd.concat([result_dff,result('SVR_1',pipe_svr_1,X_train_rfe,y_train,X_val_rfe,y_val)])
#result_dff
#Feature importance function
def feat_imp(model,X_data_set):
imp_feature_1=pd.DataFrame(model.feature_importances_, columns = ["Imp"], index = X_data_set.columns)
imp_feature_1=imp_feature_1.sort_values(by="Imp",ascending=False)
print(imp_feature_1)
#feature importance
plt.figure(figsize=(10,10))
imp_feature_1[:30].plot.bar(figsize=(15,5))
#First 20 and 30 feature importance sum
print("\nFirst 8 feature importance:\t",(imp_feature_1[:8].sum())*100)
print("\nFirst 12 feature importance:\t",(imp_feature_1[:12].sum())*100)
#Import library
from sklearn.tree import DecisionTreeRegressor
clf=DecisionTreeRegressor(random_state=1)
pipe_DT_1=Pipeline([('DT1',clf)])
result_dff=pd.concat([result_dff,result('DT1',pipe_DT_1,X_train,y_train,X_val,y_val)])
result_dff
#Feature importance
feat_imp(clf,X_train)
from sklearn.ensemble import RandomForestRegressor
clf=RandomForestRegressor(random_state=2)
pipe_RF_1=Pipeline([('RF1',clf)])
result_dff=pd.concat([result_dff,result('RF1',pipe_RF_1,X_train,y_train,X_val,y_val)])
result_dff
#Feature importance
feat_imp(clf,X_train)
clf=RandomForestRegressor(n_estimators=50,max_depth=18,min_samples_leaf=10,random_state=3)
pipe_RF_2=Pipeline([('RF2',clf)])
result_dff=pd.concat([result_dff,result('RF2',pipe_RF_2,X_train,y_train,X_val,y_val)])
result_dff
#Feature importance
feat_imp(clf,X_train)
from sklearn.ensemble import GradientBoostingRegressor
clf=GradientBoostingRegressor(random_state=4)
pipe_GB_1=Pipeline([('GB1',clf)])
result_dff=pd.concat([result_dff,result('GB1',pipe_GB_1,X_train,y_train,X_val,y_val)])
result_dff
#Feature importance
feat_imp(clf,X_train)
clf=GradientBoostingRegressor(n_estimators=150,max_depth=5,random_state=5)
pipe_GB_2=Pipeline([('GB2',clf)])
result_dff=pd.concat([result_dff,result('GB2',pipe_GB_2,X_train,y_train,X_val,y_val)])
result_dff
#Feature importance
feat_imp(clf,X_train)
from xgboost.sklearn import XGBRegressor
clf=XGBRegressor(objective='reg:squarederror',random_state=6)
pipe_XGB_1=Pipeline([('XGB1',clf)])
result_dff=pd.concat([result_dff,result('XGB1',pipe_XGB_1,X_train,y_train,X_val,y_val)])
result_dff
#Feature importance
feat_imp(clf,X_train)
clf=XGBRegressor(n_estimators=150,max_depth=5,random_state=7)
pipe_XGB_2=Pipeline([('XGB2',clf)])
result_dff=pd.concat([result_dff,result('XGB2',pipe_XGB_2,X_train,y_train,X_val,y_val)])
result_dff
#Feature importance
feat_imp(clf,X_train)
from sklearn.ensemble import AdaBoostRegressor
clf= AdaBoostRegressor(DecisionTreeRegressor(random_state=8))
pipe_ADAB_1=Pipeline([('ADAB1',clf)])
result_dff=pd.concat([result_dff,result('ADAB1',pipe_ADAB_1,X_train,y_train,X_val,y_val)])
result_dff
#Feature importance
feat_imp(clf,X_train)
clf= AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),n_estimators=250,learning_rate=0.005,random_state=9)
pipe_ADAB_2=Pipeline([('ADAB2',clf)])
result_dff=pd.concat([result_dff,result('ADAB2',pipe_ADAB_2,X_train,y_train,X_val,y_val)])
result_dff
#Feature importance
feat_imp(clf,X_train)
from sklearn.ensemble import BaggingRegressor
clf= BaggingRegressor(random_state=10)
pipe_BAG_1=Pipeline([('BAG1',clf)])
result_dff=pd.concat([result_dff,result('BAG1',pipe_BAG_1,X_train,y_train,X_val,y_val)])
result_dff
#Feature Importance
feature_importances = np.mean([ tree.feature_importances_ for tree in clf.estimators_], axis=0)
bg_imp_feature=pd.DataFrame(feature_importances, columns = ["Imp"],index=X_train.columns)
bg_imp_feature.sort_values(by="Imp",ascending=False)
clf= BaggingRegressor(DecisionTreeRegressor(max_depth=12),n_estimators=250,random_state=11)
pipe_BAG_2=Pipeline([('BAG2',clf)])
result_dff=pd.concat([result_dff,result('BAG2',pipe_BAG_2,X_train,y_train,X_val,y_val)])
result_dff
#Feature Importance
pd.options.display.float_format = '{:.5f}'.format
feature_importances = np.mean([ tree.feature_importances_ for tree in clf.estimators_], axis=0)
bg_imp_feature=pd.DataFrame(feature_importances, columns = ["Imp"],index=X_train.columns)
bg_imp_feature.sort_values(by="Imp",ascending=False)
We have used Linear Regression, Ridge and Lasso, KNN, Ensemble Techniques - Decision Trees, Random Forest, Bagging, AdaBoost, Gradient Boost and XGBoost - its gradient boost with regularization and its faster. R2 score on validation in range 70%-87% with RMSE in range 76000-107000. The model is showing better results.Lets hypertune to see if results could be improved further. Will use Random Forest, Gradient Boosting, XGBoost and AdaBoost hypertuning. Dropping features which are zero or very close to zero in all above 4 algos - quality_12, quality_3, quality_4.
Kindly refer Excel sheet to compare the results.
#Dropping features
X_train_ht=X_train.drop(['quality_5', 'quality_3', 'quality_4'],1)
X_test_ht=X_test.drop(['quality_5', 'quality_3', 'quality_4'],1)
X_val_ht=X_val.drop(['quality_5', 'quality_3', 'quality_4'],1)
skf = KFold(n_splits=5, random_state=12)
#Tuning of Random Forest
RF_ht = RandomForestRegressor()
params = {"n_estimators": np.arange(76,84,1),"max_depth": np.arange(16,20,1),
"max_features":np.arange(6,9,1),'min_samples_leaf': range(5, 8, 1),
'min_samples_split': range(18, 20, 1)}
RF_GV_1 = GridSearchCV(estimator = RF_ht, param_grid = params,cv=skf,verbose=1,return_train_score=True,n_jobs=2)
RF_GV_1.fit(X_train_ht,y_train)
# results of grid search CV
RF_results = pd.DataFrame(RF_GV_1.cv_results_)
#parameters best value
best_score_rf = RF_GV_1.best_score_
best_rf = RF_GV_1.best_params_
best_rf
rf_best = RandomForestRegressor(max_depth= 18, max_features= 8,n_estimators=80,min_samples_leaf=5,min_samples_split=18,
random_state=14)
result_dff=pd.concat([result_dff,result('RF_ht',rf_best,X_train_ht,y_train,X_val_ht,y_val)])
result_dff
#Feature importance
feat_imp(rf_best,X_train_ht)
GB_ht=GradientBoostingRegressor()
params = {"n_estimators": [138,142,1],"learning_rate":[0.08,0.09],"max_depth": np.arange(8, 11,1),
"max_features":np.arange(5,8,1),'min_samples_leaf': range(16, 21, 1)}
GB_GV_1 = GridSearchCV(estimator = GB_ht, param_grid = params,cv=skf,verbose=1,return_train_score=True,n_jobs=2)
GB_GV_1.fit(X_train_ht,y_train)
# results of grid search CV
GB_results = pd.DataFrame(GB_GV_1.cv_results_)
#parameters best value
best_score_rf = GB_GV_1.best_score_
best_gb = GB_GV_1.best_params_
best_gb
gb_best = GradientBoostingRegressor(learning_rate= 0.09, n_estimators= 150,max_depth= 10,
max_features= 7,min_samples_leaf=19)
result_dff=pd.concat([result_dff,result('GB_ht',gb_best,X_train_ht,y_train,X_val_ht,y_val)])
result_dff
#Feature importance
feat_imp(gb_best,X_train_ht)
ADAB_ht=AdaBoostRegressor(DecisionTreeRegressor(max_depth=28))
params = {"n_estimators": [176,182,1],"learning_rate":[0.4,0.5,0.6],'loss':['linear','square']}
ADAB_GV_1 = GridSearchCV(estimator = ADAB_ht, param_grid = params,cv=skf,verbose=1,return_train_score=True,n_jobs=2)
ADAB_GV_1.fit(X_train_ht,y_train)
# results of grid search CV
ADAB_results = pd.DataFrame(ADAB_GV_1.cv_results_)
#parameters best value
best_score_rf = ADAB_GV_1.best_score_
best_adab = ADAB_GV_1.best_params_
best_adab
adab_best = AdaBoostRegressor(DecisionTreeRegressor(max_depth=28),n_estimators=180,learning_rate=0.5,loss='linear',
random_state=15)
result_dff=pd.concat([result_dff,result('ADAB_ht',adab_best,X_train_ht,y_train,X_val_ht,y_val)])
result_dff
#Feature importance
feat_imp(adab_best,X_train_ht)
#Regularization using GridSearchCV - 1st Iteration
XGB_ht_1=XGBRegressor(objective='reg:squarederror')
params1 = {
"colsample_bytree": [i/100.0 for i in range(66,74,2)],
"learning_rate": [0.2,0.22,0.24],
"n_estimators": [185,188,1],
"subsample": [i/100.0 for i in range(62,68,1)]
}
XGB_GV_1 = GridSearchCV(estimator = XGB_ht_1, param_grid = params1,
cv=skf,
verbose = 1,
return_train_score=True,n_jobs=2)
XGB_GV_1.fit(X_train_ht,y_train)
# results of grid search CV
XGB_results_1 = pd.DataFrame(XGB_GV_1.cv_results_)
#parameters best value
best_score_xgb_1 = XGB_GV_1.best_score_
best_xgb_1 = XGB_GV_1.best_params_
best_xgb_1
#Choosing best parameter from 1st Iteration
xgb_best_1 = XGBRegressor(colsample_bytree=0.7,learning_rate=0.22,n_estimators=186,subsample=0.65,objective='reg:squarederror',
random_state=16)
result_dff=pd.concat([result_dff,result('xgb_1_ht',xgb_best_1,X_train_ht,y_train,X_val_ht,y_val)])
result_dff
#Feature importance
feat_imp(xgb_best_1,X_train_ht)
#Regularization using GridSearchCV - 2nd Iteration
params2 = {
'min_child_weight':[6,7,8,9,10],"max_depth": [3,4,5],
}
xgb_best_2 = GridSearchCV(estimator = xgb_best_1, param_grid = params2,
cv=skf,
verbose = 1,
return_train_score=True,n_jobs=2)
xgb_best_2.fit(X_train_ht, y_train)
# results of grid search CV
XGB_results_2 = pd.DataFrame(xgb_best_2.cv_results_)
XGB_results_2
#parameters best value
best_score_xgb_2 = xgb_best_2.best_score_
best_xgb_2 = xgb_best_2.best_params_
best_xgb_2
#Choosing best parameter from 2nd Iteration
xgb_best_2 = XGBRegressor(colsample_bytree=0.7,learning_rate=0.22,n_estimators=186,subsample=0.65,objective='reg:squarederror',
random_state=17,max_depth=4,min_child_weight=8)
result_dff=pd.concat([result_dff,result('xgb_2_ht',xgb_best_2,X_train_ht,y_train,X_val_ht,y_val)])
result_dff
#Feature importance
feat_imp(xgb_best_2,X_train_ht)
#Regularization using GridSearchCV - 3rd Iteration
params3 = {
'gamma':[i/1.0 for i in range(50,55,1)]
}
xgb_best_3 = GridSearchCV(estimator = xgb_best_2, param_grid = params3,
cv=skf,
verbose = 1,
return_train_score=True)
xgb_best_3.fit(X_train_ht, y_train)
# results of grid search CV
XGB_results_3 = pd.DataFrame(xgb_best_3.cv_results_)
XGB_results_3
#parameters best value
best_score_xgb_3 = xgb_best_3.best_score_
best_xgb_3 = xgb_best_3.best_params_
best_xgb_3
#Choosing best parameter from 3rd Iteration
xgb_best_3 = XGBRegressor(colsample_bytree=0.7,learning_rate=0.22,n_estimators=186,subsample=0.65,objective='reg:squarederror',
random_state=18,max_depth=4,min_child_weight=8,reg_lambda=52)
result_dff=pd.concat([result_dff,result('xgb_3_ht',xgb_best_3,X_train_ht,y_train,X_val_ht,y_val)])
result_dff
#Feature importance
feat_imp(xgb_best_3,X_train_ht)
We have executed many models and post comparing results we hyper tuned four models. All models are working well with R2 score greater than 86% RMSE is below 132600.
But best of of all is Xtreme Gradient boost - which is enhanced version of gradient boost. It includes regularisation and is faster too. Its giving R2 score of around 89.5% with RMSE of around 109000.
Moving forward this model can be improved further as dont have much data for very high priced houses. So when more data comes in we can revisit our model and make mecessary changes to accommodate more variation in data to deliver better results, maybe try to decrease RMSE.
Finally lets run our model on test data, which we havent used till now and see how it performs.
result_dff=pd.concat([result_dff,result('xgb_test',xgb_best_3,X_test_ht,y_test,X_val_ht,y_val)])
result_dff
#Feature importance
feat_imp(xgb_best_3,X_test_ht)
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
num_folds = 200
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
results = cross_val_score(xgb_best_3, X_test_ht, y_test, cv=kfold)
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
from matplotlib import pyplot
# plot scores
pyplot.hist(results)
pyplot.show()
# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on each side indicated by P value (border)
lower = max(0.0, np.percentile(results, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(results, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100, upper*100))
print('Average accuracy result on test data is %.3f%%:' % (np.mean(results)*100))
sns.set(style="darkgrid", color_codes=True)
with sns.axes_style("white"):
sns.jointplot(x=y_val, y=xgb_best_3.predict(X_val_ht), kind="reg", color="k")
plt.title('Actual and Predicted', fontsize=20) # Plot heading
plt.xlabel('Actual', fontsize=10) # X-label
plt.ylabel('Predicted', fontsize=10)
plt.tight_layout()
Finally we have the result, our final selected model is performing well on the test data R2 score of around 87.0% with RMSE of around 120000. </i>
Most important feature for pricing is furnished.The furnished house is priced higher.
Some other important features that affect price the most are living measure, latitude, above average quality of house and coastal house. So, one needs to thoroughly introspect its property on parameters suggested and list its price accordingly, similarly if one wants buy house - needs to check the features suggested above in house and calculate the predicted price. The same can than be compared to listed price.
We have build different models on 2 datasets. The performance (score and 95% confidence interval scores) of the model build on dataset-1 is better than dataset-2 as the 95% confidence interval of dataset-1 is very narrow compared to that of dataset-2. Even though the score of dataset-2 model is higher, the model has very vast range of performance scores.
The top key features to consider for pricing a property are:'furnished_1', 'yr_built', 'living_measure','quality_8', 'lot_measure15', 'quality_9', 'ceil_measure', 'total_area'. These are almost similar in both the models
So, one needs to thoroughly introspect its property on parameters suggested and list its price accordingly, similarly if one wants buy house - needs to check the features suggested above in house and calculate the predicted price. The same can than be compared to listed price.
For further improvization, the datasets can be made by treating outliers in different ways and hypertuning the ensemble models. Making polynomial features and improvising the model performance can also be explored further.
First we will define the function for data-preprocessing that is required to run through the model. Then we will recall the same for predicting the price(target) of the property.
The pickle file is created as per the steps followed for dataset-2.
#Defining Funcation to process all required steps as done in model
def model(data):
import pandas as pd
import numpy as np
X_test = pd.read_excel(data)
#Removing outliers
X_test_1=X_test[(X_test['living_measure']<=9000) & (X_test['price']<=4000000) &
(X_test['room_bed']<=10) & (X_test['room_bath']<=6)]
cols=['cid','dayhours']
X_test_1=X_test.drop(cols, inplace = False, axis = 1)
#columns to be converted to category
categ=['coast', 'furnished','quality']
#X_test_2=X_test_1[categ].astype('category')
# Concatenate X_test_dummy_1 variables with X_test_2
#X_test_final = pd.concat([X_test_1, X_test_2], axis=1)
X_test_final=X_test_1.copy()
for i in range(1,2):
X_test_final['coast_'+str(i)]=0
X_test_final['furnished_'+str(i)]=0
for i in range(1,14):
X_test_final['quality_'+str(i)]=0
for i in range(1,2):
if ((X_test_final['coast']==i).bool()):
X_test_final['coast_'+str(i)]=1
for i in range(1,2):
if ((X_test_final['furnished']==i).bool()):
X_test_final['furnished_'+str(i)]=1
for i in range(1,14):
if ((X_test_final['quality']==i).bool()):
X_test_final['quality_'+str(i)]=1
X_test_final=X_test_final.drop([ 'quality_3', 'quality_4', 'quality_1', 'quality_2', 'quality_5','price'],1)
# Drop categorical variable columns
X_test_final = X_test_final.drop(X_test_final[categ], axis=1)
return X_test_final
import pickle
with open('model_pickle','wb') as f:
pickle.dump(xgb_best_3,f)
with open('model_pickle','rb') as f:
mp=pickle.load(f)
X_test=model('innercity.xlsx')
mp.predict(X_test)
#X_test.columns